首页> 外文OA文献 >Video Captioning with Guidance of Multimodal Latent Topics
【2h】

Video Captioning with Guidance of Multimodal Latent Topics

机译:基于多模态潜在主题指导的视频字幕

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

The topic diversity of open-domain videos leads to various vocabularies andlinguistic expressions in describing video contents, and therefore, makes thevideo captioning task even more challenging. In this paper, we propose anunified caption framework, M&M TGM, which mines multimodal topics inunsupervised fashion from data and guides the caption decoder with thesetopics. Compared to pre-defined topics, the mined multimodal topics are moresemantically and visually coherent and can reflect the topic distribution ofvideos better. We formulate the topic-aware caption generation as a multi-tasklearning problem, in which we add a parallel task, topic prediction, inaddition to the caption task. For the topic prediction task, we use the minedtopics as the teacher to train a student topic prediction model, which learnsto predict the latent topics from multimodal contents of videos. The topicprediction provides intermediate supervision to the learning process. As forthe caption task, we propose a novel topic-aware decoder to generate moreaccurate and detailed video descriptions with the guidance from latent topics.The entire learning procedure is end-to-end and it optimizes both taskssimultaneously. The results from extensive experiments conducted on the MSR-VTTand Youtube2Text datasets demonstrate the effectiveness of our proposed model.M&M TGM not only outperforms prior state-of-the-art methods on multipleevaluation metrics and on both benchmark datasets, but also achieves bettergeneralization ability.
机译:开放域视频的主题多样性导致描述视频内容时出现各种词汇和语言表达,因此使视频字幕任务更具挑战性。在本文中,我们提出了统一的字幕框架M&M TGM,该框架以无监督的方式从数据中挖掘多模式主题,并为字幕解码器提供这些主题的指导。与预定义主题相比,挖掘的多模式主题在语义和视觉上更加连贯,可以更好地反映视频的主题分布。我们将主题感知字幕生成公式化为一个多任务学习问题,在其中我们添加了并行任务,主题预测以及字幕任务之外的内容。对于主题预测任务,我们使用Minedtopics作为老师来训练学生主题预测模型,该模型学习从视频的多模式内容中预测潜在主题。主题预测为学习过程提供了中间监督。对于字幕任务,我们提出了一种新颖的主题感知解码器,可以在潜在主题的指导下生成更准确,更详细的视频描述。整个学习过程是端到端的,并且可以同时优化两个任务。在MSR-VTT和Youtube2Text数据集上进行的大量实验结果证明了我们提出的模型的有效性.M&M TGM不仅在多重评估指标和两个基准数据集上均优于现有技术,而且具有更好的泛化能力。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号